138 research outputs found
On Anonymizing the Provenance of Collection-Based Workflows
We examine in this paper the problem of anonymizing the prove-nance of collection-oriented workflows, in which the constituent modules use and generate sets of data records. Despite their popularity , this kind of workflow has been overlooked in the literature w.r.t privacy. We, therefore, set out in this paper to examine the following questions: How the provenance of a collection-based module can be anonymized? Can lineage information be preserved? Beyond a single module, how can the provenance of a whole work-flow be anonymized? As well as addressing the above questions, we report on evaluation exercises that assess the effectiveness and efficiency of our solution. In particular, we tease apart the parameters that impact the quality of the obtained anonymized provenance information
Automatic vs Manual Provenance Abstractions: Mind the Gap
In recent years the need to simplify or to hide sensitive information in
provenance has given way to research on provenance abstraction. In the context
of scientific workflows, existing research provides techniques to semi
automatically create abstractions of a given workflow description, which is in
turn used as filters over the workflow's provenance traces. An alternative
approach that is commonly adopted by scientists is to build workflows with
abstractions embedded into the workflow's design, such as using sub-workflows.
This paper reports on the comparison of manual versus semi-automated approaches
in a context where result abstractions are used to filter report-worthy results
of computational scientific analyses. Specifically; we take a real-world
workflow containing user-created design abstractions and compare these with
abstractions created by ZOOM UserViews and Workflow Summaries systems. Our
comparison shows that semi-automatic and manual approaches largely overlap from
a process perspective, meanwhile, there is a dramatic mismatch in terms of data
artefacts retained in an abstracted account of derivation. We discuss reasons
and suggest future research directions.Comment: Preprint accepted to the 2016 workshop on the Theory and Applications
of Provenance, TAPP 201
SHARP: Harmonizing Galaxy and Taverna workflow provenance
International audienceSHARP is a Linked Data approach for harmonizing cross-workflow provenance. In this demo, we demonstrate SHARP through a real-world omic experiment involving workflow traces generated by Taverna and Galaxy systems. SHARP starts by interlinking provenance traces generated by Galaxy and Taverna workflows and then harmonize the interlinked graphs thanks to OWL and PROV inference rules. The resulting provenance graph can be exploited for answering queries across Galaxy and Taverna workflow runs
PAV ontology: provenance, authoring and versioning
Provenance is a critical ingredient for establishing trust of published
scientific content. This is true whether we are considering a data set, a
computational workflow, a peer-reviewed publication or a simple scientific
claim with supportive evidence. Existing vocabularies such as DC Terms and the
W3C PROV-O are domain-independent and general-purpose and they allow and
encourage for extensions to cover more specific needs. We identify the specific
need for identifying or distinguishing between the various roles assumed by
agents manipulating digital artifacts, such as author, contributor and curator.
We present the Provenance, Authoring and Versioning ontology (PAV): a
lightweight ontology for capturing just enough descriptions essential for
tracking the provenance, authoring and versioning of web resources. We argue
that such descriptions are essential for digital scientific content. PAV
distinguishes between contributors, authors and curators of content and
creators of representations in addition to the provenance of originating
resources that have been accessed, transformed and consumed. We explore five
projects (and communities) that have adopted PAV illustrating their usage
through concrete examples. Moreover, we present mappings that show how PAV
extends the PROV-O ontology to support broader interoperability.
The authors strived to keep PAV lightweight and compact by including only
those terms that have demonstrated to be pragmatically useful in existing
applications, and by recommending terms from existing ontologies when
plausible.
We analyze and compare PAV with related approaches, namely Provenance
Vocabulary, DC Terms and BIBFRAME. We identify similarities and analyze their
differences with PAV, outlining strengths and weaknesses of our proposed model.
We specify SKOS mappings that align PAV with DC Terms.Comment: 22 pages (incl 5 tables and 19 figures). Submitted to Journal of
Biomedical Semantics 2013-04-26 (#1858276535979415). Revised article
submitted 2013-08-30. Second revised article submitted 2013-10-06. Accepted
2013-10-07. Author proofs sent 2013-10-09 and 2013-10-16. Published
2013-11-22. Final version 2013-12-06.
http://www.jbiomedsem.com/content/4/1/3
Efficient Feedback Collection for Pay-as-you-go Source Selection
Article No. 1International audienceTechnical developments, such as the web of data and web data extraction, combined with policy developments such as those relating to open government or open science, are leading to the availability of increasing numbers of data sources. Indeed, given these physical sources, it is then also possible to create further virtual sources that integrate, aggregate or summarise the data from the original sources. As a result, there is a plethora of data sources, from which a small subset may be able to provide the information required to support a task. The number and rate of change in the available sources is likely to make manual source selection and curation by experts impractical for many applications, leading to the need to pursue a pay-as-you-go approach, in which crowds or data consumers annotate results based on their correctness or suitability, with the resulting annotations used to inform, e.g., source selection algorithms. However, for pay-as-you-go feedback collection to be cost-effective, it may be necessary to select judiciously the data items on which feedback is to be obtained. This paper describes OLBP (Ordering and Labelling By Precision), a heuristics-based approach to the targeting of data items for feedback to support mapping and source selection tasks, where users express their preferences in terms of the trade-off between precision and recall. The proposed approach is then evaluated on two different scenarios, mapping selection with synthetic data, and source selection with real data produced by web data extraction. The results demonstrate a significant reduction in the amount of feedback required to reach user-provided objectives when using OLBP
Privacy-preserving data analysis workflows for eScience
©2019 Copyright held by the author(s). Computing-intensive experiences in modern sciences have become increasingly data-driven illustrating perfectly the Big-Data era’s challenges. These experiences are usually specified and enacted in the form of workflows that would need to manage (i.e., read, write, store, and retrieve) sensitive data like persons’ past diseases and treatments. While there is an active research body on how to protect sensitive data by, for instance, anonymizing datasets, there is a limited number of approaches that would assist scientists identifying the datasets, generated by the workflows, that need to be anonymized along with setting the anonymization degree that must be met. We present in this paper a preliminary for setting and inferring anonymization requirements of datasets used and generated by a workflow execution. The approach was implemented and showcased using a concrete example, and its efficiency assessed through validation exercises
The Research Object Suite of Ontologies: Sharing and Exchanging Research Data and Methods on the Open Web
Research in life sciences is increasingly being conducted in a digital and
online environment. In particular, life scientists have been pioneers in
embracing new computational tools to conduct their investigations. To support
the sharing of digital objects produced during such research investigations, we
have witnessed in the last few years the emergence of specialized repositories,
e.g., DataVerse and FigShare. Such repositories provide users with the means to
share and publish datasets that were used or generated in research
investigations. While these repositories have proven their usefulness,
interpreting and reusing evidence for most research results is a challenging
task. Additional contextual descriptions are needed to understand how those
results were generated and/or the circumstances under which they were
concluded. Because of this, scientists are calling for models that go beyond
the publication of datasets to systematically capture the life cycle of
scientific investigations and provide a single entry point to access the
information about the hypothesis investigated, the datasets used, the
experiments carried out, the results of the experiments, the people involved in
the research, etc. In this paper we present the Research Object (RO) suite of
ontologies, which provide a structured container to encapsulate research data
and methods along with essential metadata descriptions. Research Objects are
portable units that enable the sharing, preservation, interpretation and reuse
of research investigation results. The ontologies we present have been designed
in the light of requirements that we gathered from life scientists. They have
been built upon existing popular vocabularies to facilitate interoperability.
Furthermore, we have developed tools to support the creation and sharing of
Research Objects, thereby promoting and facilitating their adoption.Comment: 20 page
- …